In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
from collections import Counter
from sklearn.preprocessing import LabelEncoder
data = pd.read_csv("VIEW_-_Grand_theft_auto_2004-2014.csv")
In [2]:
data.columns
Out[2]:
Aha, so here are all of the features of the above dataset. This is important to note for modeling, visualization, and any analysis. Certain features are numeric (e.g. longitude,latitude,...) while others are categorical (e.g. city names, gang_related,...).
In [3]:
len(data)
Out[3]:
We furthemore see that this is quite a dense set of information with nearly 160k records.
In [4]:
cityThefts = Counter(data['CITY']).most_common(20)
cities = [i[0] for i in cityThefts]
thefts = [i[1] for i in cityThefts]
print cities
print thefts
In [5]:
plt.figure(figsize=(12,8))
plt.bar(range(len(thefts)),thefts)
plt.xticks(range(len(cities)),cities,rotation=68)
plt.title("Total GTA count per city from 2004-2014")
plt.xlabel("Top 20 Cities")
plt.ylabel("Thefts")
plt.show()
It is important to note here that the above chart only displays the total count recorded for the top 20 recorded cities. It does not adjust for population, so in some sense you can consider this a population chart more so than a car theft statistic. If the population count were given for each city for the time interval of 2004-2014, then it would be possible to construct a proper chart to display the crime rate per city (since it would be adjusted for population).
In [6]:
data2014 = data[data.CRIME_YEAR == 2014]
cityThefts2014 = Counter(data2014['CITY']).most_common(20)
cities2014 = [i[0] for i in cityThefts2014]
thefts2014 = [i[1] for i in cityThefts2014]
In [7]:
plt.figure(figsize=(18,6))
plt.subplot(1,2,1)
plt.bar(range(len(thefts2014)),thefts2014)
plt.xticks(range(len(cities2014)),cities2014,rotation=68)
plt.title("Total GTA count per city from 2014")
plt.xlabel("Top 20 Cities")
plt.ylabel("Thefts")
plt.subplot(1,2,2)
plt.bar(range(len(thefts)),thefts)
plt.xticks(range(len(cities)),cities,rotation=68)
plt.title("Total GTA count per city from 2004-2014")
plt.xlabel("Top 20 Cities")
plt.ylabel("Thefts")
plt.show()
The above charts would be better visualized if adjusted for population.
Let's look at the features again and see how many unique counts there are per feature to visualize how complex the dataset truly is at this time.
In [8]:
data.columns
Out[8]:
In [26]:
uniqueLists = []
for i in data:
uniqueLists.append(np.unique(data[i],return_counts=True))
The above code is handily using the recent 1.9 version of numpy which has the 'return_counts' parameter which returns two lists - features, counts. I only recently updated my numpy package so the above counter code has remained in use.
In [32]:
for i in range(len(uniqueLists)):
print data.columns[i],len(uniqueLists[i][0])
In [ ]: